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(57) Abstract: A method and system aie provided 
in which a decision tree-based model ("general 
moder) is scaled down ("trim-down**) for a 
given task. The trim-down mode] can be adapted 
for the given task using task specific data. The 
general model can be based on a hidden maikov 
model (HMM). By allowing a decision, tree-based 
acoustic model ("genaal model**) to be scaled 
according to the vocabulary of the given task, the 
general model can be configured dynamically into 
a trim-down mode], wliich can be used to improve 
speech recognition performance and reduce system 
resource utilization. Furthennore, the trim-down 
model can be adapted/adjusted according to task. 
specific, data, e.g., task vocabulaiy, model size, or 
othor like task specific data. 
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METHOD AND SYSTEM TO SCALE DOWN A DEQSION TREE-BASED 
HTODEN MARKOV MODEL (HMM) FOR SPEECH RECOGNmON 

FIELD OF THE INVENTION 

The present invention relates generally to speech processing and to automatic ' 
speech recognition systems (ASR). More particularly/ the present invention relates to a 
metiiod and system to scale down a decision tree-based hidden markov model (HMM) 
for speech recognition- 

BACKGROUND OF THE INVENTION 

A speech recognition system recognizes a collection of spoken words ("speech**) 
into recognized phrases or sentences. A spoken word typically includes one or more 
phones or phonemes, which are distinct sounds of a spoken word. Thus, to recognize 
speech, a speech recognition system must determine relationships between ttie words 
in the speech. A common way of determining relationships between words in 
recognizing speech is by U3ing a general-purpose acoustical model C'general model'*) 
based on a hidden markov model (HMM). • 

TypicaUy, a HMM is a dedsion tree-based model in which the HMM uses a 
series of transitions from state to state to model a letter, a word, or a sentence. Each 
arc of the transitions has an associated probability, which gives the probability of the 
transition from one state to the next at an end of an observation frame. As such, an 
unknown speech signal can be represented by ordered states with a given probability. 
Moreover, words in an unknown speech signal can be recognized by using the ordered 
states of the HMM. The HMM, however, can pkce a heavy burden on system 
resources. 

Thus, a chaUenge for speedi recognition systeins is how to utilte 
resources for improving the performance of using a general model such as the HMM. 
A disadvantage of using the general modd is that it is trained for broad use from a 
very large vocabxilary, which can lead to poor performance for special applications 
related to specific vocabulary. For example, a mismatch between speaker 
characteristics, transmission channels, training data, etc., can degrade speech 
recognition performance using the general model. Another disadvantage of the 
general modd is that it requires extensive computation costs and high resource 
utilization. 
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BRIEF DESCRIFnON OF THE DRAWINGS 

The features and adv^tages of the present invention are illustrated by way of 
example and not intended to be limited by the figures of the accompanying drawngs 
in which like references indicate similar elements, and in which: 

FIG. 1 is a diagram illustrating an exemplary digital processing system for 
implementing the present invention; 

FIG. 2 is a flow chart illustrating an operation to scale a general model into a task 
specific model according to one embodiment; 

FIGS. 3A through i3D are exemplary diagrams illustrating how decision trees can 
be scaled; 

FIG. 4 is a flow chart illustrating a trim-down operation according to one 
embodiment; 

HG. 5 is a flow chart illustrating a task adapting and interpolating operation 
according to one embodiment; and 

FIG. 6 is a flow diagram illustrating an interpolation process according to one 
embodiment. 

DETAILED DESCRIPTION 

According to one aspect of the invention, a method and ^tem are provided in 
which a decision tree-based model ("general model") is scaled down ("trim-down'O for 
a given task. The trim-down model can be adapted for the given task using task 
specific data. The general model can be based on a hidden markov model (HMM) and 
can be trained through a large scale general purpose (task-independent) speech 
database. 

By allowing a decision tree-based acoustic model C'general model") to be scaled 
according to the vocabulary of the given task, the general model can be configured 
dynamically into a trim-down model, which can be used to improve speech 
recognition performance and reduce system resource utilization. Ftirthermore, the 
trim-down model can be adapted/adjusted according to task specific data, e.g./ task 
vocabulary, model size, or other like task specific data. 

FIG. 1 is a diagram illustrating an exemplary digital processing system 100 for 
implementing the present invention. The speech processing and speech recognition 
techniques described herein can be implemented and utilized within digital processing 
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system 100^ which can represent a general pxiipose computer, portable computer, 
hand-held electrpnic device, or other like device. The components qf digital 
processing system 100 are exemplary in which one or more components can be 
omitted or added. For example, one or more memory devices can be utilized for 
digital processing system 100. 

Referring to FIG. 1, digital processing system 100 includes a central processing 
unit 102 and a signal processor 103 coupled to a display circuit 105, main memory 104, 
static memory 106, and mass storage device 107 via bus 101. Digital processing system 
100 can also be coupled to a display 121, keypad input 122, cursor control 123, hard 
copy device 124, input/ output (I/O) devices 125, and audio/speech device 126 via 
bus 101. 

Bus 101 is a standard sysiem bus for commtmicating information and signals. 
CPU 102 and signal processor \03 are processing uiuts for digital processing system 
100. CPU 102 or signal processor 103 or both can be used to process information 
and/or signals for digital processing system 100. Signal processor 103 can be used to 
process speech or audio information and signals for speech processing and 
recognition. Alternatively, CPU 102 can be vsed to process speech or audio 
information and signals for speech processing or recognition. CPU 102 includes a 
control unit 131, an arithmetic logic unit (ALU) 132, and several registers 133, which 
are used to process information and signals. Signal processor 103 can also indude 
siixiilar components as CPU 102. 

Main memory 104 can be, e.g., a random access memory (RANQ or some other 
dynamic storage device, for storing infomiation or instructions (program code), which 
are used by CPU 102 or signal processor 103. For example, main memory 104 may 
store speech or audio information and instructions to be executed by signal processor 
103 to process die speech or audio information. Main memory 104 may also store 
temporary variables or otiier intermediate information during execution of 
instructions by CPU 102 or signal processor 103. Static memory 106, can be, e.g./ a 
read only memory (ROM) and/or other static storage devices, for storing information 
or instructions, which can also be tised by CPU 102 or signal processor 103. Mass 
storage device 107 can be, e.g., a hard or floppy disk drive or optical disk drive, for 
storing information or instructions for digital processing system 100. 
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Display 121 can be, e-g., a cathode ray tube (CRT) or liquid crystal display (LCD). 
Display device 121 displays information or graphics to a user. Digital processing 
system 101 can interface with display 121 via display circuit 105. Keypad input 122 is a 
alphanumeric input device for communicating information and command selections to 
digital processing system 100. Cursor control 123 can be, e.g., a mouse, a trackball, or 
cursor direction keys, for controlling movement of an object on display 121. Hard 
copy device 124 can be, e.g., a laser printer, for printing information on paper, film, or 
some other like medium. A number of input/ output devices 125 can be coupled to 
digital processing system 100. For example, a speaker can be coupled to digital 
processing system 100. Audio/speech device 126 can be, e.g., a microphone with an 
analog to digital converter, for capturing sounds of speech in an analog form and 
transforming the soimds into digital form, which can be used by signal processor 203 
and/or CPU 102, for speech processing or recognition. 

The speech processing techniques described herein can be implemented by 
hardware and/or software contained within digital processing system 100. For 
example, CPU 102 or signal processor can execute code or instructions stored in a 
machine-readable meditun, e.g., n^in memory 104, to process or to recognize speech. 

The machine-readable medium may include a mechanism that provides (i.e., 
stores and/or transmits) information in a form readable by a machine such as 
computer or digital processing device. For example, a machine-readable medium may 
indude a read only memory fROM), random access memory (RAM), magnetic disk 
storage media, optical storage media, flash memory devices. The code or instructions 
can be represenl^ by carrier wave signals, iiifrared signals, digital sign^ 
other like agnals. 

FIG. 2 is a flow chart illustrating an operation 200 to scale a general model into a 
task specific model according to one exxibodiment Operation 200 includes two parts. 
The first part relates to how ttie general model is scaled by the vocabulary of the given 
task to create a trim-down model. The second part relates to adapting the trim-down 
model based on task specific data to create a task qpedfic model. These two parts of 
operation 200 can be used separately or in combinatioru 

Referring to FIG. 2, at operations 202 and 204, vocabulary knowledge of a given 
task and a g^fieral model are input to a trimrdown operation 206 for reducing the 
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general model according to the vocabtdary of a given or specific task into a "trim- 
down model". 

The general model may be a general-purpose acoustical model such as the HMM. 
For example, the general model may be a decision tree-based HMM. The general 
model may include an exhaustive list of words for a given language. The vocabxdary 
knowledge of the given task, however, may indude only a small subset of the general 
model. For example, the vocabulary knowledge of the given task may be related to 
Veadier reporting" and indude words and phrases sudi as, for example, "doudy," 
"rain today/' "temperature," "humidity," "the temperature today will be," etc 

At operation 206, the general modd is trimmed down according to the 
vocabulary of the given task to create a "trim-down model " The trim-down operation 
206 allows a general model to be scaled for use by specific vocabulary instead of 
having to use the exhaustive vocabulary of the general model. 

At operation 208, the trim-down modd is adapted with task specific data to 
configure the trim^own model according to specific task requirements sudi as, for 
example, task specific vocabtdary, model size, etc. For example,.the trim-down modd \ 
can be trained for a task dependent modd. 

At operation 210, a task specific modd can be obtained after adapting the trim- 
down modd. For example, a general modd can be scaled down for a given task such 
as, for example, weather reporting and adapted with weather reporting specific data to 
create a task specific modd for weather reporting. Thus, operation 200 provides a fast 
and accurate task specific iiiodd for recognizing speech of the given task by 
dowTCisizing or scaling a general modd for the given task. 

FIGS. 3A through 3D are exemplary diagrams iQustrating how decision trees 
can be scaled. For purposes of illustration, the decision trees are trees based on 
dedsion tree state dustering for a generd modd such as the HMM. In the foUowing 
examples of FIGS. 3A tiirough 3D, child nodes or leaf nodes of the same parent node 
represent HMM states of similar acoustic realization. Parameters under diild or leaf 
nodes are dustered down to a smaller dze. Each HMM state represents a triphone 
unit A triphone imitindudes a phone or phoneine having a left arid right context. 

Ftirttiermore, each node of the decision trees is rdated to a question regarding 
the phonetic context of the triphone dtisters with the same central phone. A set of 
states can be recursivdy partitioned into subsets according to the phonetic questions at 
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each tree node if traversing the decision tree from the root to its leaves. States reaching 
the same child or leaf node on the decision tree are regarded as similar and the nurriber , 
of child or leaf nodes determines the nimiber of states. 

The following exemplary decision trees illustrate the rules for merging nodes of 
a decision tree of a general modd. That is, if all the child or leaf nodes from a parent 
' node do not occur in the given task, ttie child or leaf nodes will merge with the parent 
node. In addition, if all tihe child or leaf nodes from a parent node do occur in Hie 
given task, the child or leaf nodes will not merge with the parent node. Furthermore, 
if any parent node having incomplete descendants (i.e., descendants that do not occur 
in the given task), it will be replaced by its child or leaf node, which does have 
complete descendants. 

FIG. 3A and 3B illustrate if all of the child or leaf nodes from a parent node do 
not occur in the given task, the .child or leaf nodes wiU merge with the parent node. 
Referring to FIG. 3A, an exemplary decision tree is shown having a plurality of nodes 
302 through 310. A node labeled with a line ("-") through it refers to a node that 
occurs in a given task. Node 308 refers to a node that occurs in the given task. For 
example, node 308 may refer to the word "cloudy" for the given task of Veather 
reporting." For one implementation, during a trimrdown operation, nodes like node 
308 and 310 will merge with its parent node 304 because not all of the child or leaf 
nodes from its parent node 304 occur in the given task. Referring to FIG. 3B, another 
exemplary decision tree is shown having a plurality of nodes 312 through 320. The 
decision tree has no nodes labeled with a line ("-") through it Thus, for another 
implementation, during a trim-down pperatiorv child or leaf nodes like nodes 318 and 
320 will merge with its parent node 314 because tihey do not occur in the given task. 

FIG. 3C illustrates if all of tiie child or leaf nodes from a parent node do occur in 
the given task, the child or leaf nodes will not merge with liie parent node. Referring 
to FIG. 3C, an exemplary decision tree is shown having a plurality of nodes 322 
ttiroug^330. Anodelabdedwithalir\er-*)throug^itreferstoanodetiuit6 
ag^ventask. The nodes 328 and 330 refer to a node that occurs in the given task. 
Thus, for another implementation, during a trim-down operation, nodes 328 and 330 
will not merge with its parent node 324. 

FIG. 3D illustrates that if any parent iiode having incomplete descendants (i.e., 
descendants that do not occur in the given task), it will be replaced by its child or leaf 
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node, which does have complete descendants. Referring to FIG. 3D, an exemplary- 
decision tree is shown having a pltirality of nodes 332 through 342. The parent node 
333 has child or leaf nodes that are incomplete, however, child or leaf node 338 has 
complete descendants, ie., nodes 340 and 34Z Thus, node 338 will merge with parent 
node 333. Node336may also merge with parent node 333 foUowing llie rules set forth 
inFIGS-SAandSB, 

FIG. 4 is a flow chart illiistrating in further detail the trim-down operation 206 
of FIG. 2 according to one embodiment Referring to FIG. 4, at operation 402, a 
decision tree of a general model is obtained. The dedsioii tree can be similar to the 
exemplary decision trees shown in FIGS. 3A tiKrough 3D. 

At operation 404, leaf nodes of tiie are labeled if they occur in the given task 
vocabulary. For example, as shown in FIG. 3A, node 308 is labeled as having occurred 
in the given task. . 

At operation 406, leaf nodes are merged if necessary. iPor one implementation, 
as illustrated in.FIGS. 3A and SB, if all of the child or leaf nodes from a parent node do 
- not occur in the given task, die child or leaf nodes will merge with the parent node. 
For aziother implementation, as illustrated in FIG. 3C> if all of the child or leaf nodes 
from a parent node do occur in the given task, the child or leaf nodes will not merge 
with the parent node. For another implementatLon, as illustrated in FIG. 3D, if any 
parent node having incomplete descendants (i.e., descendants that do not occur in the 
given task, it will be replaced by its child or leaf node, which does have complete 
descendants. 

The merged nodes in the trim-down operation 206 can be represented by mixed 
. gaussian distributions to achieve high performance in speech recognition systems. For 
example, given two leaf fchild") nodes n^ and it^ with Gaussian distributions 
Gi = JVOij.Ei) and G^^ //(/ij.Ea)/ respectively, using a Baunv-Welch ML estimate, 
the following two equations can be used: 




(1) 




where X = {speech data aligned to Gaussian Gi with occupancy coimt y (:c) for 



each data x}, a = (x) is total occupancy of Gaussian Gj in the training data. 
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Fiirthermore/ assxmung both sets of data X and Y are modeled by the combined 
Gaussian G = N(fi, J) ) ' l^^if nodes itj and are merged together. 

The following merged mean and variance can be computed as follows: 

a + b a-Vb 

a + 6 
a + ft ^ 

where b ^ ]^y(y) is total occupancy of Gaussian in the training data. 

HQ. 5 is a flow chart illustrating a task adapting and interpolating operation 500 
according to one embodiment Referring to FIG. 5, at operation 502, a trim-down 
model of a general model is obtained. For example, the operation 206 can be used to 
obtain a trim-down model. 

At operation 504, the trinv-down model is trained witii task specific adaptation 
data. The task adaptation data can be used to adapt the general model (using the trim- 
down model) for the vocabulary o the given task. At operation 506, a task dependent 
model is derived after the training process of operation 504 using the trimrdown 
model and the task specific adaptation data. 

At operation 508, the trim-down model and task dq^endent model can be 
interpolated. The interpolation process can use an approximate maxinium a posterior 
(AMAP) estimatiori process as will be e)qplained in FIG. 6. 

At operation 510, a specific task model is generated after interpolating the trim- 
down model.and the task dependent model. 

FIG. 6 is a flow diagram illustrating an interpolation process (AMAF) 600 
according to one embodiment Referring to FIG. 6, at functional block 602, adaptation 
data is input to a Baum-Welch process to obtain a posterior probability. The posterior 
probability can be obtain as follows. 

Suppose that {s^ ) is the probability of being at state s^ at time t given the 
current HMM parameters, and {s^ ) is the posterior probability of the ilh mixture 
component 
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^ p(a),tx,)jV(jc,;AipX/) 
J,pi(Oj\x,)N(x,;fij\fXj) 

At functional blocks 606, 608, and 610, an approximate MAP (AMAP) 
estimation scheme can be implemented by linearly combining tiie general model and a 
task dependent model (ID model) with task specific counts for each component 
density. 

(x),--=A{.>;+(l-A)(*r 
n,^;=V+(l-A)n,^ 

Where the stqperscripts on the right-hand side denote the data over which the 
following statistics (comts) are<x>llected during one iteration of the forward-backward 

algoritlun. 

{x), = ^YMyi'Mx, 

Where the wdght A controls the adaptation rate. Using t^^ 
we can compute the AMAP estimates of tiie means and covariances of each Gaussian 
component density from 



AMAP _ 



f^i " „ AMAP 

T\ AMAP 



SAMAP _ \XX )f AMAP. AMAP 

i- ~ AMAP ft KHi J 

At functional block 612, a task adaptation model or a task specific anodel is 
obtained by the combination of the general model and llie task dependent model. 

The following is exemplary high level code for in^lementing a trimrdown 
operation to scale a general model. For exan^le, the trim-down operation as 
described herein can be implemented using the following exemplary code. 
Furthermore, the trim-down operation can be implemented with any standard 
programming language such as, for example, or "C-H-/' 
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#indude 'lunm/himnphys Ji" 
#iiudude 'Tiimn/hnunparamil** 
#iiiclude "limm/hiiuiuxiap Ji** 
#indude * jhnun/hzmnlistii" 
Sindude Tunm/phnencJi" 
#iiidude ••hnun/phnnameJi" 
#mclude Itmrn/mDnoJi** 
#mclude **dcstree/questionJi" 
#mdude "statis/statisJi" 
#iiiclude "dcstree.h'* 
#indude "dsizemodeLvoc Ji** 
#include *logmattii\" 
#iiudude "exscaxLh** 

extern HMMPhysical *hminphys; 
extern HMMParam *hmniparain; 
extern HNfMMap "^hziiininap; 
extern DcsTreeQiiestiaEi *dtq; 
extern STATB *hnunocc; 
extern ExStateScan *exscaiv 
extern MonoPhone *iiiono; 
extern float *newWgtSet; 
extern float ''liewMeanSet; 
extern float *tiewVarSet; 

extern int*newNiiiixSel; - 
extern int '•tiewMixSet; 

This function combine the nodes with none or only one child node cazibe traversed from the 
vocabulary list V 

int I^izeModeL-:VodSlergeModes.M(aDcsTVeeNode '^lode^nt magic^ newstate) { 
printfC*entering node%d\n"Atode->id); 
node->magic=inagic; 
if (node->VocLjExisl>s^) return newstate; 
if {node->VocLExisl==-l) { 

if (!node->yes) return newstate; 
■ if (node->yes->VocL.Exist==-l) 

iiewstate=VocMergeNodes_M(node->yes^gicpMWState); • 
if (nodeT>no->Voc_Exist=-l) 
newstate=VocMergeNodesJM{node->no^agic^ewslate); 

} 

if ({node->yes->VocJSxist>=2) | | (node->no->Voc_Exist>=2)) | 
node->VocLjExisfcsaiode->yes->VocLPxist+nod^ 
return newstate; 

1 

if ((node->yes->VocJxis^l)&&{node->no->Voc.,^xist=l)) { 
node->Voc_Exist=2; 
return newstate; 

1 

if ((node->yes->VocJSxist>=0)&&(node->no->Voc_Exist>=0)) { 
float cost=MexgeGaussianCost(node->ye5^ode->no); 
//if (cost>=0^e+10) { printfCcost=%132f\n*)pretum newstate;} 

printffMerge node%d and node%d^ the lost cost=%102f\nVu)de->yes->i4,node->no->id/X)sl 
for (int sss=0; sss<nini3g SSS++) { 
FinalCGLTopl2lsss]xoinl>Cost==CGLTopl2[sss].combCost; 
FinalCGLTopl2[sss]xoinbWdgJit=<:GLTopl2tsss]uxfl^ 
for (int ttt==0,-ttt<DSvecSize,ttt++) { 
FinalCGLTopl2[ss5lxombMeaiiIttt]<:GLTopl2[sss]xoinbMe^ 
FinalCGLTopl2[ssslxombVar[tttl=CGLT<qpl2[ssslxombVar^ 

\ 

1 ' 
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1 

CopyPara2nl(node->yes^ode^>no^ewstate); 

pikitfCnew state %d is gjenerated successftJlylW^aewstate); 

node->VocLExist==«u)de->y€&->Voc_Exist+node->i^ 

node->state=newstate+newstatebase; 

node->occ=node->yes->occ+node->no->occ; 

newstate4+; 

return newstate; 

1 

/* This function mark Ae nodes widi one child node's Vo<LExist=sO */ 

aDcsTieeNode * DSizeModek:Vo(Mergd>Todes..^efine(aDcsTr^^ *node^t xnagic^aDcsTreeNode 
*teinp jiode) { 

priritf("entering node%dW>ode->id); 

node->iiiagic=magi<;; 

if (node->VocLExist<=l) return temp„node; 
if (node->Voc3a5t>«2) | 

if (lnode->yes) { printfi("Error\n*0; return temp jnode;} 

tenipjiode=VocMergeNcKies_Refine(node->yes,magic,tempjaode)^^ 

temp jiode=Vo<MergeNodes Jtefirffi(node->no^gic,tempjio^^ 

if (((node->yes->Vo<JExist>=2)&&(nDde->no->VocLJExist^ |((node->no- 
>Voc_Exist>=2)&&(node->yes->VocJSxist==0))) { 
if (temp_node) { 

printfTStarting refineinent%d node is being replaced by %d nodeW, node->id, temp jiode->id); 
node->occ=node->yes->occ+node->no->occ; 

node->VocLExist;si000; / * mark with a large number which can not be achieved in general cases V 
retuxnnode; 

return tenq> jnode; 

) 

if ((node->yes->VocJExist«l)&&<node->no->Vo<UBxist=l)) { 
retuxnnode; 

1 

retuxnnode; 

float DSizeModd::MergeGaussianCo5t(aDcsTreeNode *a,a^ 
iht e^dxl=sa->state; 
int esidx2=b->state; 

float alj7l,minimuxAXiaxmxdx; 

intnuxl,m]x2; 

intnmixlpimix2; 

float '^e^rdJ^BilJ^sm2,'^3i2^me^ 
int *tiewnmix;!™wNmixSel; 
bit ^tiewxnixssnewMbcSet; 
if (esidxl<new5tatebase) { 

if (a->occ<=0.0) a->occ«hmmocc->GetOcct(esidxl); 

mixl=hmmparam->ExState(esidxl)->mix; 

nmixl=hiQnmparam->ExState{esidxl)->nmix; 

meanlshmmparam->Mean(mixl); 

varl=hmmpaxazxir>Var(mixl); 
)else{ 

nmixl=newninix[esidxl-newstatebasel; 

meanXjrs=newMeanSet+newmix[esidxl-newstatebase]*DSvecSiz^^ 
varl..;rsnewVaiSet4^newmix[e5icbd-iiew5tateba^^ 

1 

if (esidx2<newstatebase) ( 
if (b->occ<=0.0) b->occ=^iirunocc->GetOcct(esidx2); 

11 
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iiiix2=kinmparaiib->ExState(esidx2)->nu^ 
nznix2==hinmparam->ExState(esiclx2)->nmix; 
meaii2=hnunparam->Meah(ipjx2); 
vai2=hinxnparam->Var(iiux2); 
}else{ 

ranix2==newranix[esidx2-iiewstatebasel; 
vai2jct™wVarSet4«evmux(esidx2-ne^ 

} 

float ♦wgtl,*wgt2; 

if (esidxl<newstateba5e) 

wgashmmparaznr>Wdg^t(luximpara]iir> 
else 

wgtl==newWgtSet+iiewiriix[esi<to-newstatebaseL' 
if (esidx2<new5tatebase) 

wgt2=hininparam->Wd^t(hzmnparam->ExState(e5id^ 
else 

wgt2=newWgtSet+newinix[esicbc2-newstatebasel; 
float aa/bb; 
float SiunVar[36]; 

inirininixs(ninixl>imux2)?mnixluimi^ 
inaxnmix==(ninixl>nmix2)?raM)dat^^ 
if ((iimix!=raiixl) I I (ninix!=iimix2)) { 
pnntff*ex%d and ex%d*5 Mbctuxe Nuznber xnismatdv they are %d and 
%d\n"^idxl,esidx2,iunixl,ziinix2); 

Here we only choose the state with max Mbcture Number's parameter as the newstate's parameter 

V 

f or (int i=0A<maxnmix4++) { 
if (ximix2>nmixl) { 
if (esidx2<new5tatdbase) { 
mean2=hmmparam->Mean(mix2+i); 
var2=hmiixparam-'>Var(inix2+i); 
}else{ 

mean2=:mean2^+i*DSyecSize; 
vai2=var2j+i*(DSvecSize+l); 

} 

CGLTopl2Ii].combWeight=wgt2[i]; 
. for (int j=0,j<DSvecSize;j-H-) { 

CGLTopl2[i].combMeanQ]smean2|jl; 
CGLTopl2Ii].aHrf)Var01=?^ 

1 

lelse{ 

if (esidxl<newstatebase) { 

meanl=hinmpaiam->Mean(mixl+i); 

varl=hinmparam->Var(mixl+i); 
}elsej 

meanl=3iieanl_r+i*DSvecSize; 
varl=*varljr+i*(DSvecSize+l); 

1 

CGLTopl2[U.combWeigjht=wgtl[il; 
for (int j=:Og<DSvecSize;j++) I 

CGLTopl2[i].combMeanIjl=meaiai3]; 

CGLTopl2[a.combVar|jl=svarl|j+ll; 

1 

} 

1 

return -1.0; 
} • 
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for (int ii==04i<ninix;u++) { 
aa=exp(wgtltiiD; 

al=a->occ*aa; . . . 

if (esidxl<new5tatebase) { 

meanl=hinmparam->Nfean(inixl+ii); 

varl=hnimparainr>Var(inixl+ii); 
}else{ 

meanl==meaxLU:+ii*DSvecSize; 
varl=varlj:+ii*(DSvecSi2e+l); 

} 

for (int j==Ogj<ninix,j[j++) ( 
bb=exp(wg;l2Ijj]); 
bl=ib->occ*bb; 
if (esidx2<newstatebase) ( 

mean2shnunparaxiH>Mean(znix24-j^^ 

v2ur2=hiiimparam->Var(iiibc2+ij); 
} else { 

ixiean2=nneaii2jr+jj*DSvecSi2e; 
var2=vai2jr+ij*(DSvecSize+l); 

1 

CGL[ii*iiniix+jn.combCosb=0.0; 
if(bhat){ 
for (int i=0A<DSvecSize4++) { • 
SumVar[i]=05*(1.0/varlIi+l]+1.0/vai2Ii+lD; 

} 

) 

for (int i=04<DSvecSize4++) { 
CGL[ii*mnix4jj].combMeanIiHal*meanl[il+^ 

CGL[ii*tiiiux4fllxonAVar[il=(al*(1.0/varl[i^^ . 
(irieaxa[i1<:GL[ii*turilx+sl.coii^ 
(ineait2[i]-CGL[ii'^nmix+gl.coinbMeanti])))/ 

CGL[ii*mnix+ijl.combWd5ht=(al+bl)/(a->occ4^>^ 

if{rbhat){ 

CGLIii*mnix4jflxombCost+Kal+bl)nog(C^^ 
-ji)l*log(vai2[i+lD; 
) 

/♦ Bhattacharyya Distance V 
else{ 

CGL[ii*lmux4ij]xombCost+=<)5*log(SumVar[i])H^^*ao^^ 
CCa*(ii'toix4|l.conibCost+M}.125*(meanl 
^^ean2[iD/SmnVa^[il+05»los(Sumyar[i])^^055»^^^^ 

} 

} 

} 

1 

/* sort the ninix*turax distances to ninix V 
for (iilt topranix==0,iopninix<nniix;tqpnn^ { 

idxl[topziinix]'=ninix-)-l; 

idx2[topmnix]snniix+l; 

1 

bool notfirsbsfalse; 
boolRepstrue; 

for (int topmxuxM}/*topninix<3unix;topnniix++) { 
float zninCostsl.Oe+10; 
Rep=true; 

f or (int iii=0/*iii<nmixpii++) I 
for (int S=0,;ffl<nnuxgij++) | 
if (notficst)! 

for (int couzittopn==0;coimttopn<topninix;;coimttapn++) { 

13 
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if (idxl[coi]nttopn]=sm) Rep^lalse; 

I 

, for (int coimttopn=^;counttopn<topnmixxounltopn++) { 
if (idx2[counttopn]===jg) Rep=false; 

1 

if ((GGL[iii*liinix+jg].combCost<m^ 
nunCost^GL[iii'hm:nix+jjj]xon^ 
idxl[t(qminix]sdii4cbc2[topxuxux]=^ 

1 

Repstrue; 

} . 
else { 

if ((CGL[iii*tiiiux+]g]xoii*Co^^ 
ininCost=<:GL[iii*iunix4jij]-combCost; 
idxl[topnxnix]=dii4dx2[tx>pnx^ 
//not&st=strue; 

1 

1 

} 

I 

notfirst=true; 

CGLTopl2[topnmix]xoinbCost==2iunCost; 
for (kit ttt=0,'ttt<DSvecSi2e;ttt++) { 

CGLropl2[topiuiux]xonibMean[tttI=CGLtidxl[fopn^^ 

CGLTopl2[topniiuxIxombVar[ttt]=1.0/CGL[idxl[^^ 

) 

CGLTopl2[topiunix]xombWdghts=log(CGL[idxl[topimiix]*t^^ 

1 

, CGLTopl2liuidxI.combCost==0.0; 
for (int i=0; i<niiiSx; 

CGLTopl2[niraxlxoiiibCostH^GLTopl2[i],combCost 
return CGLTopl2[muxlxombCost; 

1 . 

void D5izeModekKZopyF^aaml(aDc^reeNode *nl, aDcsTreeMode *r2, int newstate) { 
int esidxl=snl->state; • 
int esidx2=?n2->state; 
int n2nixl^iimix2,inaxnmix; 
int "^ncwnrnixsTOwNxnixSeb 
if (esidxl<newstateba5e) { 

nnuxl=hinmpaxam->E)£tate(esidxl)->nznix; 
}else{ 

n]iiixl=^newiunix[esidxl-new5tatebase]; 

1 

if (esidx2<newstatebase) { 

ninix2shirimparam->£xState(esidx2)->ninix; 
}else{ 

nznix2s^newnzxiix[esidx2-newstatebase]; 

1 

inaxranix^nxnixl>timbc2)?n]xiixl3imix2; 
//printf("ninaxinix=%d\n"Aonaxmix); 
int *tiewmix; 
if (newstate=0) { 

newmixsnewMixSet; 

newinix[new5tate}=0; 
}else{ 

newxnix=:newMixSet; 

z\ewxxiix[newstate]=4iewxnix[new5tate-ll+newniii^ 

1 

newnxnix[new5tate]sinaxiizni>Q 

14 
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float *newmean=5iewMeaiiSet+newinixtnevvstaterDSvecS 
float ''tiewvar=newVaiSet+newinix[new^ 
float *newwgb=newWgtSet4-newmix[newstate]; 
for (int i=04<inaxninix4++) { 
newgt(i]=FinalCGLTopl2[iIxoinbWeight; 
float tempvar=O.0; 
for (int j=:03<DSvecSize,-(++) { 
newmean[j]=FinalCGLTopl2[i].combMeanBl; 
newvarI]+ll=FinalCGLTopl2[i].coinbVarlj]; 
tempvar-l«log(2.0*MJ^float)log((double)n^^ 

1 

newvar[0]=tempvar; 

newmean+^DSvecSize; 

newvar+=pSvecSize+l); 

) 

) • 

int IDSizeModeL:GaiissiaanDistancfiLJVOc(^^^ I 
printf Centering \n"); 

printfCxiewstatebase=%d\n"^ewslatebase); 
mtnew5tate=0; 

aDcsTreeNode *temp jiode=NULL; 

FinalCGLTopl2=(aCoinbGatiss *)inem->Manoc(sizeof (aCoinbGauss)*iumx,true); 
CGLTopl2=(aConibGaiiss ♦)mem->MaUoc(sizeof(aCoinbGauss)*(ninix+l),true); 
TempCGLTopl2=(aCoinbGauss*)mem->Ma]loc(sizeof(aCombGai^ 
CGL=(aCoinbGauss *)mem->Manoc(sizeof(aCoxnbGauss)*tmux''^^ 
idxl=(5nt*)mem->Malloc(sizeof(int)*tunix,true); . 
. idx2=Kint *)mexn->MaUoc(sizeof (int)*nmix,true); 
for (int i«04<ninix+l { 
if(i<ninix){ 

HnalCGLTopl2[ilxooibNfeanKfloat *)mem->Manoc(sizeof(float)*DSvecSize,Jxue); 
ExialCGLTopU[i]xainbVai^(float '^)mem->MaUM^ 

1 •■ ■ 

CGLTopl2[i]xombMean=(float *)inem->Malloc(sizeof(float)*DSvecSize,true); 
CGLTopl2ti].combVar=s(float ♦)infim->Malloc(sizeof(float)*DSvecSize,true); 
TeznpCGLTopl2[i]xombMean=(float *)menb>Malloc(sizeof(float)*DSvecSi2e^trae); 
TeinpCGLTopl2[i].combVarKfloat '>ciem->MaUoc(sizeof(float)'^I%vecSize,t^^ 

1 

for (int i=On<nmix*ninix;i-H-) { 
CGLIil.conibMean=(float *)ineih->Malloc(sizeof(float)*DSvecSize,true); 
CGLIi]xombVar=(float *)mem->Malloc(sizeof(float)*^ 

) 

for (int i==04<mono->monoFhnNinn^++) ( 
if (i=DStee) continue; 
for(intj=0fj<3g++){ 
if(DStrees[iHil){ 
/* normal models*/ 

printf("Now we are processing %d exstate's %dth state, start newstate=%d\n"4/j/newstate); 
newstate=VocMergeNodesJ^(&DStrees[i]0]->root4^DStrees[i]lj]^^^ 
tempjiode=VocMergeNodesJRefine(&DSlrees[ilOI->root,^ 
>rootJnagic,taa:ipjiode); 
I 

) 

} 

2nein->Free(CGL); 
memr>Free(FinalCGLTopl2); 
mein->Free(CGLTopl2); 
menir>Free(TempCGLTopl2); 

15 
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niein->Free(idxl); 
mem->Free(idx2); 
return 1; 

} 

DSizeModekiDSizeModelO {}. 

void DSizeModeL-Jiutialize(int tee, DcsTree ***trees, int vecSize, int hxninMfa^ aHMMDcsTreeNode 
*dnodes, 

int nodeNiuni, bcx)l isbhat) { 
zunix^hixunMb^ 
DSteeatee; 
bhat=isbhat; 
DSvecSize=^ecSize; 
DStreesstrees; 
DSdcsNodesssdnodes; 
DSnodeNtmisnodeNum; 

1 

Thus, a method and system to scale a decision tree-based hidden markov model 
(HMM) for speech recognition. The method and system described are particularly 
suitable for automatic speech recognition systems (ASR). In the foregoing " 
spedficatioiv the invention has been described with reference to specific exemplary 
embodiments thereof. It will, however, be evident that various modifications and 
changes may be made thereto without departing firom the broader spirit and scope of 
the invention as set forth in tiie appended claims. The specification and drawings are, 
accordingjby, to be regarded in an iUi2Strative sense rather than ^ 
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What is daimed is: 
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1. A speech processing method comprising: 

scaling a decision tree-based model for a given task. 

2. The metiiod of daim 1, wherein tiie decision tree-based model is a dedsion tree- 
based hidden markov model (HMM). 

3. The method of daim 1, further comprising: 

adapting the scaled decision tree-based model for the given task. 

4. A speedi processing system coir^rising: 

a memory to store a decision tree-based model for a given task; and . 
a processor to scale the decision tree-based modd for the given task. 

5. ' Thesystemof daim 4^ wherein the dedsion tree-based model is a decision tree- 
based hidden markov model (HMM). 

6. The system of daim 4, wherein the processor is to adapt the scaled decision tree- 
based model for the given task. 

7. A joiadiine-readable medium that provides instructions/ which if executed by a 
processor, cause the processor to perform the operations comprising: 

scaling a decision tree-based modd for a given task. 

8. The machine-readable medium of claim 7, further providing instructions, which 
if executed by a processor, cause the processor to perform the operations of: 

scaling the decision tree-based modd based on a hidden markov modd (HMM) 
for the given task. 

9. The machine-readable medium of daim 7, further providing instructions, which 
if executed by a prooessor,.cause the processor to perform the operations of: 
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adapting the scaled decision tree-based model for the given task. 



10. A speech processing method comprising: 
collecting a vocabtilary knowledge of a given task; and 

trimming down a general model according to the vocabtilary knowledge of tihe 
given task. 

11. The method of daim 10, further comprising: 
adapting tiae trim-down general model for the given task. 

IZ The method of daim 11, wherein the adapting the trim-down general model 
indudes: 

collecting adaptation data, tiie adaptation data being related to the given task; 

and 

adapting tiie trim-down general model to a task dependent model using the 
adaptation data. 

13- The method of daim 12, further comprising: 

interpolating tiie trimrdown general model with the task dependent model to 
obtain a task spedfic model. 

14. . The method of daim 10, wherein the general model is a hidden markov model 
(HMM). 

15. A speech processing system comprising: 
a memory to store a general model; and 

a processor to collect a vocabtilary knowledge of a given task and to trim down 
the general znddel according to the vocabulary knowledge of the given task. 

16. The system of daim 15, wherein the processor is to adapt the trim-down general 
modd for the given task. 
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17. The system of claim 16, wherein the processor is to collect adaptation data, the 
adaptation data being related to the given task and adapt the tribcn-down general 
model to a task dependent model tising the adaptation data. 

18. The system of daim 17, wherein the processor is to interpolate the trim^own 
general model wilii the task dependent model to obtain a task specific model. 

19. The system of claim 15, wherein the general model is a hidden markov model 
(HMM). 

20. A machine-readable medium that provides instructions, which if executed by a 
processor, cause the processor to perform the operations comprising: 

collecting a vocabulary knowledge of a given task; and 

trimming down a ga:ieral model according to the vocabulary knowledge of tiie • 
given task. 

21. The inacfaine-readable medium of daim 20, further providing instnictions, 
whidi if executcki by a processor, cause the processor to perform 

adapting the trim-down general model for the given task. 

22. The machine-readable medium of daim 21, further providing instructions, 
which if executed by a processor, cause the processor to perform the operations of: 

collecting adaptation data, the adaptation data being related to the given task; 

and 

adapting the trim-down general model to a task dependent modd using the 
adaptation data. 

23. The madiine-readable medimn of claim 7, further providing instructions, whidi 
if execute by a processor, caiase the processor to perform the operations ofc 

interpolating the trim-down general modd with the task dependent modd to 
obtain a task specific modd. 
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